Linear Models

Lecture 01

Dr. Colin Rundel

Linear Models

Linear Models Basics

Pretty much everything we a going to see in this course will fall under the umbrella of either linear or generalized linear models.

In previous classes most of your time has likely been spent with the simple iid case,

\[Y_i = \beta_0 + \beta_1 \, x_{i1} + \cdots + \beta_p \, x_{ip} + \epsilon_i \] \[ \epsilon_i \sim N(0, \sigma^2) \]

these models can also be expressed simply as,

\[ Y_i \overset{iid}{\sim} N(\beta_0 + \beta_1 \, x_{i1} + \cdots + \beta_p \, x_{ip},~ \sigma^2) \]

Linear model - matrix notation

We can also express using matrix notation as follows,

\[ \begin{aligned} \underset{n \times 1}{\boldsymbol{Y}} = \underset{n \times p}{\boldsymbol{X}} \, \underset{p \times 1}{\boldsymbol{\beta}} + \underset{n \times 1}{\boldsymbol{\epsilon}} \\ \underset{n \times 1}{\boldsymbol{\epsilon}} \sim N(\underset{n \times 1}{\boldsymbol{0}}, \; \sigma^2 \underset{n \times n}{\mathbb{1}_n}) \end{aligned} \]

or alternative as,

\[ \underset{n \times 1}{\boldsymbol{Y}} \sim N\left(\underset{n \times p}{\boldsymbol{X}} \, \underset{p \times 1}{\boldsymbol{\beta}},~ \sigma^2 \underset{n \times n}{\mathbb{1}_n}\right) \]

Multivariate Normal Distribution - Review

For an \(n\)-dimension multivate normal distribution with covariance \(\boldsymbol{\Sigma}\) (positive semidefinite) can be written as

\[ \underset{n \times 1}{\boldsymbol{Y}} \sim N(\underset{n \times 1}{\boldsymbol{\mu}}, \; \underset{n \times n}{\boldsymbol{\Sigma}}) \text{ where } \{\boldsymbol{\Sigma}\}_{ij} = \rho_{ij} \sigma_i \sigma_j \]

\[ \begin{pmatrix} Y_1\\ Y_2\\ \vdots\\ Y_n \end{pmatrix} \sim N\left( \begin{pmatrix} \mu_1\\ \mu_2\\ \vdots\\ \mu_n \end{pmatrix}, \, \begin{pmatrix} \rho_{11}\sigma_1\sigma_1 & \rho_{12}\sigma_1\sigma_2 & \cdots & \rho_{1n}\sigma_1\sigma_n \\ \rho_{21}\sigma_2\sigma_1 & \rho_{22}\sigma_2\sigma_2 & \cdots & \rho_{2n}\sigma_2\sigma_n\\ \vdots & \vdots & \ddots & \vdots \\ \rho_{n1}\sigma_n\sigma_1 & \rho_{n2}\sigma_n\sigma_2 & \cdots & \rho_{nn}\sigma_n\sigma_n \\ \end{pmatrix} \right) \]

Multivariate Normal Distribution - Density

For the \(n\) dimensional multivate normal given on the last slide, its density is given by

\[ (2\pi)^{-n/2} \; \det(\boldsymbol{\Sigma})^{-1/2} \; \exp{\left(-\frac{1}{2} \underset{1 \times n}{(\boldsymbol{Y}-\boldsymbol{\mu})'} \underset{n \times n}{\boldsymbol{\Sigma}^{-1}} \underset{n \times 1}{(\boldsymbol{Y}-\boldsymbol{\mu})}\right)} \]

and its log density is given by

\[ -\frac{n}{2} \log 2\pi - \frac{1}{2} \log \det(\boldsymbol{\Sigma}) - \frac{1}{2} \underset{1 \times n}{(\boldsymbol{Y}-\boldsymbol{\mu})'} \underset{n \times n}{\boldsymbol{\Sigma}^{-1}} \underset{n \times 1}{(\boldsymbol{Y}-\boldsymbol{\mu})} \]

Maximum Likelihood - \(\boldsymbol{\beta}\) (iid)

Maximum Likelihood - \(\sigma^2\) (iid)

A Quick Example

Parameters -> Synthetic Data

Lets generate some simulated data where the underlying model is known and see how various regression procedures function.

\[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} +\epsilon_i \] \[ \beta_0 = 0.7, \quad \beta_1 = 1.5, \quad \beta_2 = -2.2, \quad \beta_3 = 0.1 \] \[ \epsilon_i \sim N(0,1) \]

set.seed(1234)
n = 100
beta = c(0.7,1.5,-2.2,0.1)
eps = rnorm(n)

d = data_frame(
  X1 = rt(n,df=5),
  X2 = rt(n,df=5),
  X3 = rt(n,df=5)
) %>%
  mutate(Y = beta[1] + beta[2]*X1 + beta[3]*X2 + beta[4]*X3 + eps)

Model Matrix

X = model.matrix(~X1+X2+X3, d) 
tbl_df(X)
# A tibble: 100 × 4
   `(Intercept)`      X1     X2     X3
           <dbl>   <dbl>  <dbl>  <dbl>
 1             1  1.26   -3.30   0.664
 2             1  1.17   -4.88  -2.20 
 3             1 -0.590  -1.08  -1.95 
 4             1  0.358   0.319 -0.582
 5             1  1.20   -0.314 -2.41 
 6             1  0.856  -0.733  0.274
 7             1  0.649  -0.385  0.112
 8             1 -1.58   -0.449 -1.89 
 9             1 -0.424   1.22  -0.193
10             1  0.0808  1.27  -0.488
# … with 90 more rows

Pairs plot

GGally::ggpairs(d, progress = FALSE)

Least squares fit

Let \(\hat{\boldsymbol{Y}}\) be our estimate for \(\boldsymbol{Y}\) based on our estimate of \(\boldsymbol{\beta}\), \[ \hat{\boldsymbol{Y}} = \hat{\beta}_0 + \hat{\beta}_1 \, \boldsymbol{X}_{1} + \hat{\beta}_2 \, \boldsymbol{X}_{2} + \hat{\beta}_3 \, \boldsymbol{X}_{3} = \boldsymbol{X}\, \hat{\boldsymbol{\beta}} \]

The least squares estimate, \(\hat{\boldsymbol{\beta}}_{ls}\), is given by \[ \underset{\boldsymbol{\beta}}{\argmin} \sum_{i=1}^n \left( Y_i - \boldsymbol{X}_{i\cdot} \boldsymbol{\beta} \right)^2 \]

Previously we derived, \[ \hat{\boldsymbol{\beta}}_{ls} = (\boldsymbol{X}' \boldsymbol{X})^{-1} \boldsymbol{X}' \, \boldsymbol{Y} \]

Frequentist Fit

l = lm(Y ~ X1 + X2 + X3, data=d)
l$coefficients
(Intercept)          X1          X2          X3 
  0.6566561   1.4657537  -2.2807109   0.1629704 
(beta_hat = solve(t(X) %*% X, t(X)) %*% d$Y)
                  [,1]
(Intercept)  0.6566561
X1           1.4657537
X2          -2.2807109
X3           0.1629704

Bayesian model specification (JAGS)

model = 
"model{
  # Likelihood
  for(i in 1:length(Y)){
    Y[i] ~ dnorm(mu[i], tau)
    mu[i] = beta[1] + beta[2]*X1[i] + beta[3]*X2[i] + beta[4]*X3[i]
  }

  # Prior for beta
  for(j in 1:4){
    beta[j] ~ dnorm(0,1/100)
  }

  # Prior for sigma / tau2
  tau ~ dgamma(1, 1)
  sigma2 = 1/tau
}"

Compiling

m = rjags::jags.model(
  textConnection(model), 
  data = d, n.chains = 2
)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 100
   Unobserved stochastic nodes: 5
   Total graph size: 810

Initializing model

Sampling

Results

str(samp)
List of 2
 $ : 'mcmc' num [1:5000, 1:5] 0.607 0.659 0.668 0.829 0.801 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:5] "beta[1]" "beta[2]" "beta[3]" "beta[4]" ...
  ..- attr(*, "mcpar")= num [1:3] 1001 6000 1
 $ : 'mcmc' num [1:5000, 1:5] 0.788 0.769 0.494 0.878 0.566 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:5] "beta[1]" "beta[2]" "beta[3]" "beta[4]" ...
  ..- attr(*, "mcpar")= num [1:3] 1001 6000 1
 - attr(*, "class")= chr "mcmc.list"

CODA & mcmc objects

head(samp[[1]])
Markov Chain Monte Carlo (MCMC) output:
Start = 1001 
End = 1007 
Thinning interval = 1 
       beta[1]  beta[2]   beta[3]   beta[4]   sigma2
[1,] 0.6073990 1.421458 -2.176067 0.1349446 0.993633
[2,] 0.6593313 1.463185 -2.446135 0.1737123 1.056151
[3,] 0.6680386 1.320887 -2.367218 0.3249702 1.193919
[4,] 0.8287267 1.492184 -2.126412 0.1882640 1.306094
[5,] 0.8008090 1.381371 -2.431426 0.1651416 1.126294
[6,] 0.7237356 1.316525 -2.532359 0.1659811 1.132194
[7,] 0.5368579 1.440870 -2.482804 0.2317600 1.107264

CODA & mcmc objects - plotting

plot(samp)

Tidy Bayes

# A tibble: 50,000 × 7
# Groups:   i, .variable [5]
       i .chain .iteration .draw .variable .value param  
   <int>  <int>      <int> <int> <chr>      <dbl> <chr>  
 1     1      1          1     1 beta       0.607 beta[1]
 2     1      1          2     2 beta       0.659 beta[1]
 3     1      1          3     3 beta       0.668 beta[1]
 4     1      1          4     4 beta       0.829 beta[1]
 5     1      1          5     5 beta       0.801 beta[1]
 6     1      1          6     6 beta       0.724 beta[1]
 7     1      1          7     7 beta       0.537 beta[1]
 8     1      1          8     8 beta       0.552 beta[1]
 9     1      1          9     9 beta       0.760 beta[1]
10     1      1         10    10 beta       0.651 beta[1]
# … with 49,990 more rows

Tidy Bayes + ggplot - Traceplot

Tidy Bayes + ggplot - Density plot

Comparing Approaches

# A tibble: 5 × 4
  param   truth    ols post_mean
  <chr>   <dbl>  <dbl>     <dbl>
1 beta[1]   0.7  0.657     0.656
2 beta[2]   1.5  1.47      1.46 
3 beta[3]  -2.2 -2.28     -2.28 
4 beta[4]   0.1  0.163     0.162
5 sigma2    1    1.08      1.14 

Comparing Approaches - code

Comparing Approaches - plot

Bayesian Model

Likelihood:

\[ \boldsymbol{Y} \,|\, \boldsymbol{\beta}, \, \sigma^2 \sim N(\boldsymbol{X}\boldsymbol{\beta},\, \sigma^2 \, {\mathbb{1}_n}) \]

Priors: \[ \beta_i \sim N(0, \sigma^2_\beta) \text{ or } \boldsymbol{\beta} \sim N(\boldsymbol{0}, \sigma^2_\beta \, {\mathbb{1}_p}) \]

\[ \sigma^2 \sim \text{Inv-Gamma}(a,\,b) \]

Deriving the posterior

\[ \begin{aligned} \left[ \boldsymbol{\beta}, \sigma^2 \,|\, \boldsymbol{Y}, \boldsymbol{X} \right] &= \frac{[\boldsymbol{Y} \,|\, \boldsymbol{X}, \boldsymbol{\beta}, \sigma^2]}{[\boldsymbol{Y} \,|\, \boldsymbol{X}]} [\boldsymbol{\beta}, \sigma^2] \\ &\propto [\boldsymbol{Y} \,|\, \boldsymbol{X}, \boldsymbol{\beta}, \sigma^2] [\boldsymbol{\beta},\,\sigma^2] \\ &\propto [\boldsymbol{Y} \,|\, \boldsymbol{X}, \boldsymbol{\beta}, \sigma^2] [\boldsymbol{\beta}\,|\,\sigma^2] [\sigma^2] \end{aligned} \]

where,

\[ f(\boldsymbol{Y} \,|\, \boldsymbol{X}, \boldsymbol{\beta}, \sigma^2) = \left(2\pi \sigma^2\right)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} (\boldsymbol{Y}-\boldsymbol{X}\boldsymbol{\beta})'(\boldsymbol{Y}-\boldsymbol{X}\boldsymbol{\beta}) \right) \]

\[ f(\boldsymbol{\beta}\,|\, \sigma^2_\beta) = (2\pi \sigma^2_\beta)^{-p/2} \exp\left( -\frac{1}{2\sigma^2_\beta} \boldsymbol{\beta}'\boldsymbol{\beta} \right) \]

\[ f(\sigma^2 \,|\, a,\, b) = \frac{b^a}{\Gamma(a)} (\sigma^2)^{-a-1} \exp\left( -\frac{b}{\sigma^2} \right) \]

Deriving the Gibbs sampler (\(\sigma^2\) step)

\[ \begin{aligned} \left[ \boldsymbol{\beta}, \sigma^2 \,|\, \boldsymbol{Y}, \boldsymbol{X} \right] \propto &\left(2\pi \sigma^2\right)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} (\boldsymbol{Y}-\boldsymbol{X}\boldsymbol{\beta})'(\boldsymbol{Y}-\boldsymbol{X}\boldsymbol{\beta}) \right) \\ &(2\pi \sigma^2_\beta)^{-p/2} \exp\left( -\frac{1}{2\sigma^2_\beta} \boldsymbol{\beta}'\boldsymbol{\beta} \right) \\ &\frac{b^a}{\Gamma(a)} (\sigma^2)^{-a-1} \exp\left( -\frac{b}{\sigma^2} \right) \end{aligned} \]

Deriving the Gibbs sampler (\(\boldsymbol{\beta}\) step)

\[ \begin{aligned} \left[ \boldsymbol{\beta}, \sigma^2 \,|\, \boldsymbol{Y}, \boldsymbol{X} \right] \propto &\left(2\pi \sigma^2\right)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} (\boldsymbol{Y}-\boldsymbol{X}\boldsymbol{\beta})'(\boldsymbol{Y}-\boldsymbol{X}\boldsymbol{\beta}) \right) \\ &(2\pi \sigma^2_\beta)^{-p/2} \exp\left( -\frac{1}{2\sigma^2_\beta} \boldsymbol{\beta}'\boldsymbol{\beta} \right) \\ &\frac{b^a}{\Gamma(a)} (\sigma^2)^{-a-1} \exp\left( -\frac{b}{\sigma^2} \right) \end{aligned} \]